Chromosome-level genome assembly of Microplitis manilae Ashmead, 1904 (Hymenoptera: Braconidae)

Microplitis manilae Ashmead (Hymenoptera: Braconidae) is an important parasitoid of agricultural pests in lepidopteran species. So far, two extant genome assembles from the genus Microplitis are fragmented. Here, we offered a high-quality genome assembly of M. manilae at the chromosome level with high accuracy and contiguity, assembled by ONT long-read, MGI-SEQ short-read, and Hi-C sequencing methods. The final assembled genome size was 282.85 Mb, with 268.17 Mb assigned to 11 pseudochromosomes. The scaffold N50 length was 25.23 Mb, and the complete BUSCO score was 98.61%. The genome contained 152.37 Mb of repetitive elements, representing 53.87% of the total genome size. We predicted 15,689 protein-coding genes, of which 13,580 genes were annotated functionally. Gene family evolution investigations of M. manilae revealed 615 expanded and 635 contracted gene families. The high-quality genome of M. manilae reported in this paper will be a useful genomic resource for research on parasitoid wasps in the future.


Background & Summary
Microplitis manilae Ahmead (Hymenoptera: Braconidae: Microgastrinae) is a solitary endoparasitoid wasp and is primarily distributed in the Asia-Pacific region 1 . It attacks several lepidopteran species, with Spodoptera species being its preferred target, including S. frugiperda, S. exigua and S. litura 2 , which of them are the world's most significant agricultural pests 3 . M. manilae is thought to be an ideal biological control agent for Spodoptera spp.
So far, it has approximately 200 species have been recognized within Microplitis 1 , and some of them, i.e. M. croceipes, M. demolitor and M. mediator, have been widely used in biological pest control [4][5][6] . The virulence factors of Microplitis wasps that act to suppress or circumvent host immunity are primarily composed of polydnaviruses (PDV), venom, and teratocytes 7,8 . In recent years, the biology, ecology, and interaction with the host of Microplitis have been studied 9,10 . The study of the interactions between parasitoid wasps and their host insects, particularly the regulation of host immunity and development by parasitoid wasps, has great potential for increasing the use of parasitoid wasps in sustainable pest management in agriculture. To further understand the complex relationship between parasitoids and their hosts, high quality genome data would play an important role. The genome at the chromosome level may shed light on the evolution of parasites, the mechanisms of parasitism, the potential for developing new techniques for biological control and utilizing natural enemies as resources. However, only two fragmented genomes from the genus Microplitis (M. demolitor and M. mediator) are currently available in NCBI and a chromosome-level genome assembly for Microplitis spp. has not been published.
In this study, we used MGI short-read, ONT long-read and Hi-C sequencing technologies to assemble the M. manilae chromosome-level genome. The final genome size was 282.85 Mb with a scaffold N50 length of 25.23 Mb, and 268.17 Mb assembled genome sequences were successfully anchored on 11 chromosomes. In total, 15,689 protein-coding genes were identified, and 13,580 of them were functionally annotated. Sequencing. The extraction of DNA and RNA was performed on newly emerged male individuals that had been raised for five or more generations. Genomic DNA was obtained using the Blood & Cell Culture DNA Mini Kit (Qiagen, Hilden, Germany) for both long-read and short-read whole genome sequencing. RNA was isolated using the TRlzol reagent (Vazyme, Nanjing, China). The Hi-C library was generated using the restriction endonuclease DpnII. Long-read sequencing was carried out using the Nanopore PromethION platform (Oxford Nanopore Technologies, UK), with an insert size of approximately 20 kb. Short-read and transcriptome sequencing were performed using libraries with an insert size of 350 bp and sequenced on the MGISEQ. 2000 platform. The total data generated from the long-read sequencing was 76.31 Gb, while the total data generated from the short-read sequencing was 82.60 Gb (Table 1).
Genome size estimation and assembly. The raw reads obtained from the MGISEQ. 2000 platform were subjected to quality control using fastp v0.21.0 11 to filter adapter sequences and low-quality reads. The remaining reads of MGI library were then used to estimate the genome size of M. manilae by GenomeScope v1.0.0 12 and analyze the 17-mer distribution with Jellyfish v2.3.0 13 . The final genome size was estimated to be 297.29 Mb through K-mer analysis.
The draft genome is obtained by first assembling long reads and then polishing the results with short reads, which has been widely used in genome assembly research for different organisms recently [14][15][16][17][18] . NextDenovo v2.5.0 (https://github.com/Nextomics/NextDenovo) was used to assemble the initial assembly with ONT sequences. NextPolish v1.4.0 19 was then applied to polish the draft genome assembly using MGISEQ sequences. Juicer v1.6.2 20 was used to align Hi-C reads to the draft assembly and subject them to quality control. 3D-DNA 21 was used to anchor primary contigs into chromosomes, then corrected the possible errors manually with Juicebox v1.11.08 22    The genome completeness was evaluated with the BUSCO v4.1.4 pipeline 23 , searching against the insect_ odb10 database 24 . The analysis identified 98.61% (single-copied genes: 97.88%, duplicated genes: 0.73%), 0.44%, and 0.95% of the 1,367 predicted genes in this genome as complete, fragmented, and missing sequences, respectively. These results suggested the assembled genome is highly complete. Genome annotation. The genome of M. manilae was annotated for repetitive elements, non-coding RNAs (ncRNAs), and protein-coding genes (PCGs). The Extensive de novo TE Annotator (EDTA) pipeline 25 was used to build TE libraries for repeat annotation initially. Non-LTR retrotransposons and any unclassified TEs missed by the TE annotators mentioned above were then identified by RepeatModeler v2.0.2 26 . A comprehensive non-redundant TE library was generated combining with above results and Dfam3.2 27 . RepeatMasker v4.1.2 (http://www.repeatmasker.org) was subsequently used to search for known and novel TEs. In the genomic sequences, a total of 152.37 Mb repetitive elements were identified, constituting 53.87% of the total. The most abundant repeating element was DNA transposons (13.54%), followed by long terminal repeats (LTR, 10.43%) and long interspersed nuclear elements (LINEs, 1.75%), while unclassified repeats made up 27.43% of the total (Table 3, Fig. 2). Infernal 1.1.2 28 was used to identify rRNAs, snRNAs, and miRNAs based on the alignment with the Rfam library 29 . tRNAscan-SE v2.0.6 30 was used to predict tRNAs. Finally, 1,894 noncoding RNAs were predicted, including 1,269 transfer RNAs (tRNAs), 194 ribosomal RNAs (rRNAs), 74 micro-RNAs (miRNAs), 63 small nuclear RNAs (snRNAs), and 294 others (Table S1, Fig. 2). www.nature.com/scientificdata www.nature.com/scientificdata/ Three different strategies were applied for the annotation of PCGs: transcriptome-based prediction, de novo gene prediction, and homology-based prediction. In transcriptome-based prediction, the transcriptome was assembled from RNA-seq alignments by HISAT2 v2.2.1 31  www.nature.com/scientificdata www.nature.com/scientificdata/ PASA pipeline v2.4.1 (https://github.com/PASApipeline/PASApipeline). The repeat-masked genome was analyzed using AUGUSTUS v3.3.3 32 and SNAP v2006-07-28 33 for de novo gene prediction. The protein sequences of hymenopteran species were downloaded from the NCBI Database as references for homology-based prediction. Exonerate v2.4.0 34 was utilized to align the reference proteins to the genome assembly and predict gene structures. Finally, a consensus gene set was created by integrating the genes predicted by the aforementioned three methods using EVidenceModeler v1.1.1 35 . We predicted 15,689 protein-coding genes for the M. manilae genome by combining the evidences from the transcriptome, ab initio, and homology-based predictions. The average length of the predicted gene was 8,718 base pairs, while that of a protein-coding region was 1,575 bp. Exon and intron lengths on average were 319 and 1,814 bp, respectively. There were 4.9 exons on average per gene (Table 4).

Data records
The MGI, ONT, RNA-seq and Hi-C sequencing data used for the genome assembly have been deposited in the NCBI Sequence Read Archive (SRA) database with accession numbers SRR21358828 46 , SRR21358827 47 , SRR21358829 48 and SRR21358826 49 , respectively, under the BioProject accession number PRJNA872950. The chromosomal assembly has been deposited at GenBank with accession number JAPFQK000000000 50 . Genome annotation information has been deposited in the Figshare database 51 .

Technical Validation
Evaluating the quality of the genome assembly. The quality of M. manilae genome assembly was evaluated using two approaches. Firstly, sequencing data were mapped to the genome to verify the accuracy, yielding mapping rates of 99.52% for MGI, 94.40% for RNA-seq, and 98.52% for ONT data. Secondly, BUSCO analysis found 98.6% of the 1,367 single-copy orthologues (in the insects_odb10 database) to be complete (97.9% single-copied genes and 0.7% duplicated genes), 0.4% fragmented, and 1.0% missing.
Chromosome synteny. Chromosome synteny between M. manilae and Cotesia congregata was detected by MCScanX 52 with default parameters. The genome assembly of C. congregata 53 was retrieved from NCBI with accession number GCA_905319865.3. The visual diagram was generated using TBtools 54 . The synteny of the M. manilae assembly was compared to that of C. congregata, a closely related species of the subfamily Microgastrinae. The results showed a low level of synteny between M. manilae and C. congregata (Fig. 3). A number of fusion and fission events were detected between these two wasps. For instance, Chr11 and a part of Chr5 of M. manilae were syntenic to Chr4 of C. congregata, whereas Chr1 of M. manilae was syntenic to a portion of Chr2 and Chr3 of C. congregata. Low genome synteny was also identified between Nasonia vitripennis and Pteromalus puparum, both of which are members of the same family Pteromalidae 55 . www.nature.com/scientificdata www.nature.com/scientificdata/ Nasonia vitripennis, Orussus abietinus, Polistes dominula, and Venturia canescens (Table S2). A total of 132,122 genes were assigned to 12,544 gene families. Among them, 4,910 gene families were presented in all the species genomes, with 3,780 single-copy and 1,130 multicopy gene families. In the 15,689 predicted genes of M. manilae, 14,822 (94.47%) were grouped into 9,725 families. There were 1,295 genes in 241 families unique to M. manilae (Fig. 4, Table S3).

Gene annotation validation.
All single-copy protein sequences were concatenated into one data matrix after being aligned with MAFFT v7.427 57 . The phylogenetic tree was constructed using IQ-TREE v2.0.5 58 Fig. 4 Distribution of genes in different Hymenoptera species. "1:1:1" represents shared single-copy genes, "N:N:N" as multicopy genes shared by all species, "others" as unclassified orthologs, "unassigned" as orthologs which cannot be assigned into any gene families (orthogroups).  www.nature.com/scientificdata www.nature.com/scientificdata/ values. The topology of the phylogeny was consistent with that of the previous study 61 . The MCMCTree package in PAML v4.9j 62 was used to estimate divergence times. Based on a previous study, five calibration time points were used: root holometabolous: <300 million years ago (mya); Orussoidea + Apocrita: 211-289 mya; Apocrita: 203-276 mya; Aculeata: 160-224 mya; and Ichneumonoidea: 151-218 mya 61 . As expected, our analysis revealed that M. manilae was closely related to M. demolitor and these two species diverged approximately 7.6 mya (Fig. 5). CAFE v4.2.1 63 was used to estimate gene family expansions and contractions with a p value of 0.01. Finally, we found 615 and 635 gene families experienced expansions and contractions in M. manilae, respectively, and 395 (310 expanded and 85 contracted) of them were rapidly evolved (Fig. 5).

Code availability
This work did not utilize a custom script. Data processing was carried out using the protocols and manuals of the relevant bioinformatics software.